A simple alphabet-independent FM-index
نویسندگان
چکیده
We design a succinct full-text index based on the idea of Huffmancompressing the text and then applying the Burrows-Wheeler transform over it. The resulting structure can be searched as an FM-index, with the benefit of removing the sharp dependence on the alphabet size, σ, present in that structure. On a text of length n with zero-order entropy H0, our index needs O(n(H0 + 1)) bits of space, without any dependence on σ. The average search time for a pattern of length m is O(m(H0 +1)), under reasonable assumptions. Each position of a text occurrence can be reported in worst case time O((H0 + 1) log n), while any text substring of length L can be retrieved in O((H0 +1)L) average time in addition to the previous worst case time. Our index provides a relevant space/time tradeoff between existing succinct data structures, with the additional interest of being easy to implement. Our experimental results show that, although not among the most succinct, our index is faster than the others in many aspects, even letting them use significatively more space.
منابع مشابه
First Huffman, Then Burrows-Wheeler: A Simple Alphabet-Independent FM-Index
Main Results. The basic string matching problem is to determine the occurrences of a short pattern P = p1p2 . . . pm in a large text T = t1t2 . . . tn, over an alphabet of size σ. Indexes are structures built on the text to speed up searches, but they used to take up much space. In recent years, succinct text indexes have appeared. A prominent example is the FM-index [2], which takes little spa...
متن کاملFM-KZ: An even simpler alphabet-independent FM-index
In an earlier work [6] we presented a simple FM-index variant, based on the idea of Huffman-compressing the text and then applying the Burrows-Wheeler transform over it. The main drawback of using Huffman was its lack of synchronizing properties, forcing us to supply another bit stream indicating the Huffman codeword boundaries. In this way, the resulting index needed O(n(H0+1)) bits of space b...
متن کاملAn Efficient Composite-Alphabet Transform for String Matching under a Restricted Alphabet Set
String matching is a problem of finding all occurrences of a short pattern on a relatively long reference string. While a number of methods have been presented, most published implementations assume several restrictions due to some practical issues. We focus on the restriction of the alphabet size, which is usually set to be 256 in many string matching libraries. When strings must be handled ov...
متن کاملList of Contributions The Pre - history and Future of the Block - Sorting Compression Algorithm 4
The FM-index is a succinct text index needing only O(Hkn) bits of space, where n is the text size and Hk is the kth order entropy of the text. FM-index assumes constant alphabet; it uses exponential space in the alphabet size, σ. In this paper we show how the same ideas can be used to obtain an index needing O(Hkn) bits of space, with the constant factor depending only logarithmically on σ. Our...
متن کاملAn Alphabet-Friendly FM-Index
We show that, by combining an existing compression boosting technique with the wavelet tree data structure, we are able to design a variant of the FM-index which scales well with the size of the input alphabet Σ. The size of the new index built on a string T [1, n] is bounded by nHk(T )+O ( (n log log n)/ log|Σ| n ) bits, where Hk(T ) is the k-th order empirical entropy of T . The above bound h...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Int. J. Found. Comput. Sci.
دوره 17 شماره
صفحات -
تاریخ انتشار 2005